Bokeh is an interactive Python library for visualizations that targets modern web browsers for presentation. Its goal is to provide elegant, concise construction of novel graphics in the style of D3.js, and to extend this capability with high-performance interactivity over very large or streaming datasets. Bokeh can help anyone who would like to quickly and easily create interactive plots, dashboards, and data applications.
The following notebook is intended to illustrate some of Bokeh's interactive utilities and is based on a post by software engineer and Bokeh developer Sarah Bird.
Gapminder started as a spin-off from Professor Hans Rosling’s teaching at the Karolinska Institute in Stockholm. Having encountered broad ignorance about the rapid health improvement in Asia, he wanted to measure that lack of awareness among students and professors. He presented the surprising results from his so-called “Chimpanzee Test” in his first TED-talk in 2006.
Rosling's interactive "Health and Wealth of Nations" visualization has since become an iconic illustration of how our assumptions about ‘first world’ and ‘third world’ countries can betray us. Mike Bostock has recreated the visualization using D3.js, and in this lab, we will see that it is also possible to use Bokeh to recreate the interactive visualization in Python.
Widgets are interactive controls that can be added to Bokeh applications to provide a front end user interface to a visualization. They can drive new computations, update plots, and connect to other programmatic functionality. When used with the Bokeh server, widgets can run arbitrary Python code, enabling complex applications. Widgets can also be used without the Bokeh server in standalone HTML documents through the browser’s Javascript runtime.
To use widgets, you must add them to your document and define their functionality. Widgets can be added directly to the document root or nested inside a layout. There are two ways to program a widget’s functionality:
bokeh serve
to start the Bokeh server and set up event handlers with .on_change
(or for some widgets, .on_click
).
In [15]:
# Science Stack
import numpy as np
import pandas as pd
# Bokeh Essentials
from bokeh.io import output_notebook
from bokeh.plotting import figure, show, ColumnDataSource
# Layouts
from bokeh.layouts import layout
from bokeh.layouts import widgetbox
# Figure interaction layer
from bokeh.io import show
from bokeh.io import output_notebook
# Data models for visualization
from bokeh.models import Text
from bokeh.models import Plot
from bokeh.models import Slider
from bokeh.models import Circle
from bokeh.models import Range1d
from bokeh.models import CustomJS
from bokeh.models import HoverTool
from bokeh.models import LinearAxis
from bokeh.models import ColumnDataSource
from bokeh.models import SingleIntervalTicker
# Palettes and colors
from bokeh.palettes import brewer
from bokeh.palettes import Spectral6
To display Bokeh plots inline in a Jupyter notebook, use the output_notebook()
function from bokeh.io. When show()
is called, the plot will be displayed inline in the next notebook output cell. To save your Bokeh plots, you can use the output_file()
function instead (or in addition). The output_file()
function will write an HTML file to disk that can be opened in a browser.
In [2]:
# Load Bokeh for visualization
output_notebook()
Some of Bokeh examples rely on sample data that is not included in the Bokeh GitHub repository or released packages, due to their size. Once Bokeh is installed, the sample data can be obtained by executing the command in the next cell. The location that the sample data is stored can be configured. By default, data is downloaded and stored to a directory $HOME/.bokeh/data. (The directory is created if it does not already exist.)
In [3]:
import bokeh.sampledata
bokeh.sampledata.download()
In order to create an interactive plot in Bokeh, we need to animate snapshots of the data over time from 1964 to 2013. In order to do this, we can think of each year as a separate static plot. We can then use a JavaScript Callback
to change the data source that is driving the plot.
Bokeh exposes various callbacks, which can be specified from Python, that trigger actions inside the browser’s JavaScript runtime. This kind of JavaScript callback can be used to add interesting interactions to Bokeh documents without the need to use a Bokeh server (but can also be used in conjuction with a Bokeh server). Custom callbacks can be set using a CustomJS
object and passing it as the callback argument to a Widget
object.
As the data we will be using today is not too big, we can pass all the datasets to the JavaScript at once and switch between them on the client side using a slider widget.
This means that we need to put all of the datasets together build a single data source for each year. First we will load each of the datasets with the process_data()
function and do a bit of clean up:
In [5]:
def process_data():
# Import the gap minder data sets
from bokeh.sampledata.gapminder import fertility, life_expectancy, population, regions
# The columns are currently string values for each year,
# make them ints for data processing and visualization.
columns = list(fertility.columns)
years = list(range(int(columns[0]), int(columns[-1])))
rename_dict = dict(zip(columns, years))
# Apply the integer year columna names to the data sets.
fertility = fertility.rename(columns=rename_dict)
life_expectancy = life_expectancy.rename(columns=rename_dict)
population = population.rename(columns=rename_dict)
regions = regions.rename(columns=rename_dict)
# Turn population into bubble sizes. Use min_size and factor to tweak.
scale_factor = 200
population_size = np.sqrt(population / np.pi) / scale_factor
min_size = 3
population_size = population_size.where(population_size >= min_size).fillna(min_size)
# Use pandas categories and categorize & color the regions
regions.Group = regions.Group.astype('category')
regions_list = list(regions.Group.cat.categories)
def get_color(r):
return Spectral6[regions_list.index(r.Group)]
regions['region_color'] = regions.apply(get_color, axis=1)
return fertility, life_expectancy, population_size, regions, years, regions_list
Next we will add each of our sources to the sources
dictionary, where each key is the name of the year (prefaced with an underscore) and each value is a dataframe with the aggregated values for that year.
Note that we needed the prefixing as JavaScript objects cannot begin with a number.
In [6]:
# Process the data and fetch the data frames and lists
fertility_df, life_expectancy_df, population_df_size, regions_df, years, regions = process_data()
# Create a data source dictionary whose keys are prefixed years
# and whose values are ColumnDataSource objects that merge the
# various per-year values from each data frame.
sources = {}
# Quick helper variables
region_color = regions_df['region_color']
region_color.name = 'region_color'
# Create a source for each year.
for year in years:
# Extract the fertility for each country for this year.
fertility = fertility_df[year]
fertility.name = 'fertility'
# Extract life expectancy for each country for this year.
life = life_expectancy_df[year]
life.name = 'life'
# Extract the normalized population size for each country for this year.
population = population_df_size[year]
population.name = 'population'
# Create a dataframe from our extraction and add to our sources
new_df = pd.concat([fertility, life, population, region_color], axis=1)
sources['_' + str(year)] = ColumnDataSource(new_df)
You can see what's in the sources
dictionary by running the cell below.
Later we will be able to pass this sources
dictionary to the JavaScript Callback. In so doing, we will find that in our JavaScript we have objects named by year that refer to a corresponding ColumnDataSource
.
In [7]:
sources
Out[7]:
We can also create a corresponding dictionary_of_sources
object, where the keys are integers and the values are the references to our ColumnDataSources from above:
In [8]:
dictionary_of_sources = dict(zip([x for x in years], ['_%s' % x for x in years]))
In [14]:
js_source_array = str(dictionary_of_sources).replace("'", "")
js_source_array
Out[14]:
In [16]:
xdr = Range1d(1, 9)
ydr = Range1d(20, 100)
plot = Plot(
x_range=xdr,
y_range=ydr,
plot_width=800,
plot_height=400,
outline_line_color=None,
toolbar_location=None,
min_border=20,
)
In order to display the plot in the notebook use the show()
function:
In [19]:
# show(plot)
In [20]:
# Create a dictionary of our common setting.
AXIS_FORMATS = dict(
minor_tick_in=None,
minor_tick_out=None,
major_tick_in=None,
major_label_text_font_size="10pt",
major_label_text_font_style="normal",
axis_label_text_font_size="10pt",
axis_line_color='#AAAAAA',
major_tick_line_color='#AAAAAA',
major_label_text_color='#666666',
major_tick_line_cap="round",
axis_line_cap="round",
axis_line_width=1,
major_tick_line_width=1,
)
# Create two axis models for the x and y axes.
xaxis = LinearAxis(
ticker=SingleIntervalTicker(interval=1),
axis_label="Children per woman (total fertility)",
**AXIS_FORMATS
)
yaxis = LinearAxis(
ticker=SingleIntervalTicker(interval=20),
axis_label="Life expectancy at birth (years)",
**AXIS_FORMATS
)
# Add the axes to the plot in the specified positions.
plot.add_layout(xaxis, 'below')
plot.add_layout(yaxis, 'left')
Go ahead and experiment with visualizing each step of the building process and changing various settings.
In [22]:
# show(plot)
One of the features of Rosling's animation is that the year appears as the text background of the plot. We will add this feature to our plot first so it will be layered below all the other glyphs (will will be incrementally added, layer by layer, on top of each other until we are finished).
In [23]:
# Create a data source for each of our years to display.
text_source = ColumnDataSource({'year': ['%s' % years[0]]})
# Create a text object model and add to the figure.
text = Text(x=2, y=35, text='year', text_font_size='150pt', text_color='#EEEEEE')
plot.add_glyph(text_source, text)
Out[23]:
In [25]:
# show(plot)
Next we will add the bubbles using Bokeh's Circle
glyph. We start from the first year of data, which is our source that drives the circles (the other sources will be used later).
In [26]:
# Select the source for the first year we have.
renderer_source = sources['_%s' % years[0]]
# Create a circle glyph to generate points for the scatter plot.
circle_glyph = Circle(
x='fertility', y='life', size='population',
fill_color='region_color', fill_alpha=0.8,
line_color='#7c7e71', line_width=0.5, line_alpha=0.5
)
# Connect the glyph generator to the data source and add to the plot
circle_renderer = plot.add_glyph(renderer_source, circle_glyph)
In the above, plot.add_glyph
returns the renderer, which we can then pass to the HoverTool
so that hover only happens for the bubbles on the page and not other glyph elements:
In [27]:
# Add the hover (only against the circle and not other plot elements)
tooltips = "@index"
plot.add_tools(HoverTool(tooltips=tooltips, renderers=[circle_renderer]))
Test out different parameters for the Circle
glyph and see how it changes the plot:
In [29]:
# show(plot)
In [31]:
# Position of the legend
text_x = 7
text_y = 95
# For each region, add a circle with the color and text.
for i, region in enumerate(regions):
plot.add_glyph(Text(x=text_x, y=text_y, text=[region], text_font_size='10pt', text_color='#666666'))
plot.add_glyph(
Circle(x=text_x - 0.1, y=text_y + 2, fill_color=Spectral6[i], size=10, line_color=None, fill_alpha=0.8)
)
# Move the y coordinate down a bit.
text_y = text_y - 5
In [33]:
# show(plot)
Next we add the slider widget and the JavaScript callback code, which changes the data of the renderer_source
(powering the bubbles / circles) and the data of the text_source
(powering our background text). After we've set()
the data we need to trigger()
a change. slider
, renderer_source
, text_source
are all available because we add them as args to Callback
.
It is the combination of sources = %s % (js_source_array)
in the JavaScript and Callback(args=sources...)
that provides the ability to look-up, by year, the JavaScript version of our Python-made ColumnDataSource
.
In [34]:
# Add the slider
code = """
var year = slider.get('value'),
sources = %s,
new_source_data = sources[year].get('data');
renderer_source.set('data', new_source_data);
text_source.set('data', {'year': [String(year)]});
""" % js_source_array
callback = CustomJS(args=sources, code=code)
slider = Slider(start=years[0], end=years[-1], value=1, step=1, title="Year", callback=callback)
callback.args["renderer_source"] = renderer_source
callback.args["slider"] = slider
callback.args["text_source"] = text_source
In [37]:
# show(widgetbox(slider))
In [38]:
show(layout([[plot], [slider]], sizing_mode='scale_width'))
I hope that you'll use Bokeh to produce interactive visualizations for visual analysis:
In this section we'll take a look at visualizing a corpus by exploring clustering and dimensionality reduction techniques. Text analysis is certainly high dimensional visualization and this can be applied to other data sets as well.
The first step is to load our documents from disk and vectorize them using Gensim. This content is a bit beyond the scope of the workshop for today, however I did want to provide code for reference, and I'm happy to go over it offline.
In [3]:
import nltk
import string
import pickle
import gensim
import random
from operator import itemgetter
from collections import defaultdict
from nltk.corpus import wordnet as wn
from gensim.matutils import sparse2full
from nltk.corpus.reader.api import CorpusReader
from nltk.corpus.reader.api import CategorizedCorpusReader
CORPUS_PATH = "data/baleen_sample"
PKL_PATTERN = r'(?!\.)[a-z_\s]+/[a-f0-9]+\.pickle'
CAT_PATTERN = r'([a-z_\s]+)/.*'
In [4]:
class PickledCorpus(CategorizedCorpusReader, CorpusReader):
def __init__(self, root, fileids=PKL_PATTERN, cat_pattern=CAT_PATTERN):
CategorizedCorpusReader.__init__(self, {"cat_pattern": cat_pattern})
CorpusReader.__init__(self, root, fileids)
self.punct = set(string.punctuation) | {'“', '—', '’', '”', '…'}
self.stopwords = set(nltk.corpus.stopwords.words('english'))
self.wordnet = nltk.WordNetLemmatizer()
def _resolve(self, fileids, categories):
if fileids is not None and categories is not None:
raise ValueError("Specify fileids or categories, not both")
if categories is not None:
return self.fileids(categories=categories)
return fileids
def lemmatize(self, token, tag):
token = token.lower()
if token not in self.stopwords:
if not all(c in self.punct for c in token):
tag = {
'N': wn.NOUN,
'V': wn.VERB,
'R': wn.ADV,
'J': wn.ADJ
}.get(tag[0], wn.NOUN)
return self.wordnet.lemmatize(token, tag)
def tokenize(self, doc):
# Expects a preprocessed document, removes stopwords and punctuation
# makes all tokens lowercase and lemmatizes them.
return list(filter(None, [
self.lemmatize(token, tag)
for paragraph in doc
for sentence in paragraph
for token, tag in sentence
]))
def docs(self, fileids=None, categories=None):
# Resolve the fileids and the categories
fileids = self._resolve(fileids, categories)
# Create a generator, loading one document into memory at a time.
for path, enc, fileid in self.abspaths(fileids, True, True):
with open(path, 'rb') as f:
yield self.tokenize(pickle.load(f))
The PickledCorpus
is a Python class that reads a continuous stream of pickle files from disk. The files themselves are preprocessed documents from RSS feeds in various topics (and is actually just a small sample of the documents that are in the larger corpus). If you're interestd in the ingestion and curation of this corpus, see baleen.districtdatalabs.com.
Just to get a feel for this data set, I'll load the corpus and print out the number of documents per category:
In [5]:
# Create the Corpus Reader
corpus = PickledCorpus(CORPUS_PATH)
In [6]:
# Count the total number of documents
total_docs = 0
# Count the number of documents per category.
for category in corpus.categories():
num_docs = sum(1 for doc in corpus.fileids(categories=[category]))
total_docs += num_docs
print("{}: {:,} documents".format(category, num_docs))
print("\n{:,} documents in the corpus".format(total_docs))
Our corpus reader object handles text preprocessing with NLTK (the natural language toolkit), namely by converting each document as follows:
Here is an example document:
In [7]:
fid = random.choice(corpus.fileids())
doc = next(corpus.docs(fileids=[fid]))
print(" ".join(doc))
The next step is to convert these documents into vectors so that we can apply machine learning. We'll use a bag-of-words (bow) model with TF-IDF, implemented by the Gensim library.
In [8]:
# Create the lexicon from the corpus
lexicon = gensim.corpora.Dictionary(corpus.docs())
# Create the document vectors
docvecs = [lexicon.doc2bow(doc) for doc in corpus.docs()]
# Train the TF-IDF model and convert vectors to TF-IDF
tfidf = gensim.models.TfidfModel(docvecs, id2word=lexicon, normalize=True)
tfidfvecs = [tfidf[doc] for doc in docvecs]
# Save the lexicon and TF-IDF model to disk.
lexicon.save('data/topics/lexicon.dat')
tfidf.save('data/topics/tfidf_model.pkl')
Documents are now described by the words that are most important to that document relative to the rest of the corpus. The document above has been transformed into the following vector with associated weights:
In [9]:
# Covert random document from above into TF-IDF vector
dv = tfidf[lexicon.doc2bow(doc)]
# Print the document terms and their weights.
print(" ".join([
"{} ({:0.2f})".format(lexicon[tid], score)
for tid, score in sorted(dv, key=itemgetter(1), reverse=True)
]))
In [10]:
# Select the number of topics to train the model on.
NUM_TOPICS = 10
# Create the LDA model from the docvecs corpus and save to disk.
model = gensim.models.LdaModel(docvecs, id2word=lexicon, alpha='auto', num_topics=NUM_TOPICS)
model.save('data/topics/lda_model.pkl')
Each topic is represented as a vector - where each word is a dimension and the probability of that word beloning to the topic is the value. We can use the model to query the topics for a document, our random document from above is assigned the following topics with associated probabilities:
In [11]:
model[lexicon.doc2bow(doc)]
Out[11]:
We can assign the most probable topic to each document in our corpus by selecting the topic with the maximal probability:
In [12]:
topics = [
max(model[doc], key=itemgetter(1))[0]
for doc in docvecs
]
Topics themselves can be described by their highest probability words:
In [13]:
for tid, topic in model.print_topics():
print("Topic {}:\n{}\n".format(tid, topic))
We can plot each topic by using decomposition methods (TruncatedSVD in this case) to reduce the probability vector for each topic into 2 dimensions, then size the radius of each topic according to how much probability documents it contains donates to it. Also try with PCA, explored below!
In [14]:
# Create a sum dictionary that adds up the total probability
# of each document in the corpus to each topic.
tsize = defaultdict(float)
for doc in docvecs:
for tid, prob in model[doc]:
tsize[tid] += prob
In [15]:
# Create a numpy array of topic vectors where each vector
# is the topic probability of all terms in the lexicon.
tvecs = np.array([
sparse2full(model.get_topic_terms(tid, len(lexicon)), len(lexicon))
for tid in range(NUM_TOPICS)
])
In [16]:
# Import the model family
from sklearn.decomposition import TruncatedSVD
# Instantiate the model form, fit and transform
topic_svd = TruncatedSVD(n_components=2)
svd_tvecs = topic_svd.fit_transform(tvecs)
In [17]:
# Create the Bokeh columnar data source with our various elements.
# Note the resize/normalization of the topics so the radius of our
# topic circles fits int he graph a bit better.
tsource = ColumnDataSource(
data=dict(
x=svd_tvecs[:, 0],
y=svd_tvecs[:, 1],
w=[model.print_topic(tid, 10) for tid in range(10)],
c=brewer['Spectral'][10],
r=[tsize[idx]/700000.0 for idx in range(10)],
)
)
# Create the hover tool so that we can visualize the topics.
hover = HoverTool(
tooltips=[
("Words", "@w"),
]
)
# Create the figure to draw the graph on.
plt = figure(
title="Topic Model Decomposition",
width=960, height=540,
tools="pan,box_zoom,reset,resize,save"
)
# Add the hover tool
plt.add_tools(hover)
# Plot the SVD topic dimensions as a scatter plot
plt.scatter(
'x', 'y', source=tsource, size=9,
radius='r', line_color='c', fill_color='c',
marker='circle', fill_alpha=0.85,
)
# Show the plot to render the JavaScript
show(plt)
The bag of words model means that every token (string representation of a word) is a dimension and a document is represented by a vector that maps the relative weight of that dimension to the document by the TF-IDF metric. In order to visualize documents in this high dimensional space, we must use decomposition methods to reduce the dimensionality to something we can plot.
One good first attempt is toi use principle component analysis (PCA) to reduce the data set dimensions (the number of vocabulary words in the corpus) to 2 dimensions in order to map the corpus as a scatter plot.
We'll use the Scikit-Learn PCA transformer to do this work:
In [18]:
# In order to use Scikit-Learn we need to transform Gensim vectors into a numpy Matrix.
docarr = np.array([sparse2full(vec, len(lexicon)) for vec in tfidfvecs])
In [19]:
# Import the model family
from sklearn.decomposition import PCA
# Instantiate the model form, fit and transform
tfidf_pca = PCA(n_components=2)
pca_dvecs = topic_svd.fit_transform(docarr)
We can now use Bokeh to create an interactive plot that will allow us to explore documents according to their position in decomposed TF-IDF space, coloring by their topic.
In [20]:
# Create a map using the ColorBrewer 'Paired' Palette to assign
# Topic IDs to specific colors.
cmap = {
i: brewer['Paired'][10][i]
for i in range(10)
}
# Create a tokens listing for our hover tool.
tokens = [
" ".join([
lexicon[tid] for tid, _ in sorted(doc, key=itemgetter(1), reverse=True)
][:10])
for doc in tfidfvecs
]
# Create a Bokeh tabular data source to describe the data we've created.
source = ColumnDataSource(
data=dict(
x=pca_dvecs[:, 0],
y=pca_dvecs[:, 1],
w=tokens,
t=topics,
c=[cmap[t] for t in topics],
)
)
# Create an interactive hover tool so that we can see the document.
hover = HoverTool(
tooltips=[
("Words", "@w"),
("Topic", "@t"),
]
)
# Create the figure to draw the graph on.
plt = figure(
title="PCA Decomposition of BoW Space",
width=960, height=540,
tools="pan,box_zoom,reset,resize,save"
)
# Add the hover tool to the figure
plt.add_tools(hover)
# Create the scatter plot with the PCA dimensions as the points.
plt.scatter(
'x', 'y', source=source, size=9,
marker='circle_x', line_color='c',
fill_color='c', fill_alpha=0.5,
)
# Show the plot to render the JavaScript
show(plt)
Another approach is to use the TSNE model for stochastic neighbor embedding. This is a very popular text clustering visualization/projection mechanism.
In [25]:
# Import the TSNE model family from the manifold package
from sklearn.manifold import TSNE
from sklearn.pipeline import Pipeline
# Instantiate the model form, it is usually recommended
# To apply PCA (for dense data) or TruncatedSVD (for sparse)
# before TSNE to reduce noise and improve performance.
tsne = Pipeline([
('svd', TruncatedSVD(n_components=75)),
('tsne', TSNE(n_components=2)),
])
# Transform our TF-IDF vectors.
tsne_dvecs = tsne.fit_transform(docarr)
In [26]:
# Create a map using the ColorBrewer 'Paired' Palette to assign
# Topic IDs to specific colors.
cmap = {
i: brewer['Paired'][10][i]
for i in range(10)
}
# Create a tokens listing for our hover tool.
tokens = [
" ".join([
lexicon[tid] for tid, _ in sorted(doc, key=itemgetter(1), reverse=True)
][:10])
for doc in tfidfvecs
]
# Create a Bokeh tabular data source to describe the data we've created.
source = ColumnDataSource(
data=dict(
x=tsne_dvecs[:, 0],
y=tsne_dvecs[:, 1],
w=tokens,
t=topics,
c=[cmap[t] for t in topics],
)
)
# Create an interactive hover tool so that we can see the document.
hover = HoverTool(
tooltips=[
("Words", "@w"),
("Topic", "@t"),
]
)
# Create the figure to draw the graph on.
plt = figure(
title="TSNE Decomposition of BoW Space",
width=960, height=540,
tools="pan,box_zoom,reset,resize,save"
)
# Add the hover tool to the figure
plt.add_tools(hover)
# Create the scatter plot with the PCA dimensions as the points.
plt.scatter(
'x', 'y', source=source, size=9,
marker='circle_x', line_color='c',
fill_color='c', fill_alpha=0.5,
)
# Show the plot to render the JavaScript
show(plt)